{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 Uses of the standard Normal distribution\n",
    "\n",
    "Suppose we wanted to answer the following question: \n",
    "\n",
    "> What is the probability of having a 'healthy' weight?\n",
    "\n",
    "A healthy weight is often is often measured using the Body Mass Index (BMI - although see [here](https://www.health.harvard.edu/blog/how-useful-is-the-body-mass-index-bmi-201603309339) and [here](https://www.bbc.co.uk/news/health-43895508) for a discussion on why this may be too simplistic a measure). An individual's BMI can be calculated using their height and weight, using the formula BMI $= \\frac{mass(kg)}{height(m)^2}$. Then people can be classified as:\n",
    "\n",
    "| Classification | BMI | \n",
    "|:-|:-|\n",
    "| Underweight |<18.5 |\n",
    "| Normal | 18.5-24.9 |\n",
    "| Overweight | 25-29.9 |\n",
    "| Obese | 30-39.9 |\n",
    "| Extremely obese | >40 |\n",
    "\n",
    "To address our question, we will use data taken from a study undertaken among a group of 76 cleaners, that investigated whether telling the cleaners they had an active lifestyle influenced their BMI. We will assume that values of BMI approximately follow a normal distribution. We do not know the true values of $\\mu$ and $\\sigma$ so we will replace these with the sample mean and standard deviation. This gives us values of $\\mu=$26.5 and $\\sigma^2=$ 18.1, as demonstrated in the snippet of code below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th scope=col>Cond</th><th scope=col>Age</th><th scope=col>Wt</th><th scope=col>Wt2</th><th scope=col>BMI</th><th scope=col>BMI2</th><th scope=col>Fat</th><th scope=col>Fat2</th><th scope=col>WHR</th><th scope=col>WHR2</th><th scope=col>Syst</th><th scope=col>Syst2</th><th scope=col>Diast</th><th scope=col>Diast2</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><td>0    </td><td>43   </td><td>137  </td><td>137.4</td><td>25.1 </td><td>25.1 </td><td>31.9 </td><td>32.8 </td><td>0.79 </td><td>0.79 </td><td>124  </td><td>118  </td><td>70   </td><td>73   </td></tr>\n",
       "\t<tr><td>0    </td><td>42   </td><td>150  </td><td>147.0</td><td>29.3 </td><td>28.7 </td><td>35.5 </td><td>  NA </td><td>0.81 </td><td>0.81 </td><td>119  </td><td>112  </td><td>80   </td><td>68   </td></tr>\n",
       "\t<tr><td>0    </td><td>41   </td><td>124  </td><td>124.8</td><td>26.9 </td><td>27.0 </td><td>35.1 </td><td>  NA </td><td>0.84 </td><td>0.84 </td><td>108  </td><td>107  </td><td>59   </td><td>65   </td></tr>\n",
       "\t<tr><td>0    </td><td>40   </td><td>173  </td><td>171.4</td><td>32.8 </td><td>32.4 </td><td>41.9 </td><td>42.4 </td><td>1.00 </td><td>1.00 </td><td>116  </td><td>126  </td><td>71   </td><td>79   </td></tr>\n",
       "\t<tr><td>0    </td><td>33   </td><td>163  </td><td>160.2</td><td>37.9 </td><td>37.2 </td><td>41.7 </td><td>  NA </td><td>0.86 </td><td>0.84 </td><td>113  </td><td>114  </td><td>73   </td><td>78   </td></tr>\n",
       "\t<tr><td>0    </td><td>24   </td><td> 90  </td><td> 91.8</td><td>16.5 </td><td>16.8 </td><td>  NA </td><td>  NA </td><td>0.73 </td><td>0.73 </td><td> NA  </td><td> NA  </td><td>78   </td><td>76   </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllllll}\n",
       " Cond & Age & Wt & Wt2 & BMI & BMI2 & Fat & Fat2 & WHR & WHR2 & Syst & Syst2 & Diast & Diast2\\\\\n",
       "\\hline\n",
       "\t 0     & 43    & 137   & 137.4 & 25.1  & 25.1  & 31.9  & 32.8  & 0.79  & 0.79  & 124   & 118   & 70    & 73   \\\\\n",
       "\t 0     & 42    & 150   & 147.0 & 29.3  & 28.7  & 35.5  &   NA  & 0.81  & 0.81  & 119   & 112   & 80    & 68   \\\\\n",
       "\t 0     & 41    & 124   & 124.8 & 26.9  & 27.0  & 35.1  &   NA  & 0.84  & 0.84  & 108   & 107   & 59    & 65   \\\\\n",
       "\t 0     & 40    & 173   & 171.4 & 32.8  & 32.4  & 41.9  & 42.4  & 1.00  & 1.00  & 116   & 126   & 71    & 79   \\\\\n",
       "\t 0     & 33    & 163   & 160.2 & 37.9  & 37.2  & 41.7  &   NA  & 0.86  & 0.84  & 113   & 114   & 73    & 78   \\\\\n",
       "\t 0     & 24    &  90   &  91.8 & 16.5  & 16.8  &   NA  &   NA  & 0.73  & 0.73  &  NA   &  NA   & 78    & 76   \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "| Cond | Age | Wt | Wt2 | BMI | BMI2 | Fat | Fat2 | WHR | WHR2 | Syst | Syst2 | Diast | Diast2 |\n",
       "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n",
       "| 0     | 43    | 137   | 137.4 | 25.1  | 25.1  | 31.9  | 32.8  | 0.79  | 0.79  | 124   | 118   | 70    | 73    |\n",
       "| 0     | 42    | 150   | 147.0 | 29.3  | 28.7  | 35.5  |   NA  | 0.81  | 0.81  | 119   | 112   | 80    | 68    |\n",
       "| 0     | 41    | 124   | 124.8 | 26.9  | 27.0  | 35.1  |   NA  | 0.84  | 0.84  | 108   | 107   | 59    | 65    |\n",
       "| 0     | 40    | 173   | 171.4 | 32.8  | 32.4  | 41.9  | 42.4  | 1.00  | 1.00  | 116   | 126   | 71    | 79    |\n",
       "| 0     | 33    | 163   | 160.2 | 37.9  | 37.2  | 41.7  |   NA  | 0.86  | 0.84  | 113   | 114   | 73    | 78    |\n",
       "| 0     | 24    |  90   |  91.8 | 16.5  | 16.8  |   NA  |   NA  | 0.73  | 0.73  |  NA   |  NA   | 78    | 76    |\n",
       "\n"
      ],
      "text/plain": [
       "  Cond Age Wt  Wt2   BMI  BMI2 Fat  Fat2 WHR  WHR2 Syst Syst2 Diast Diast2\n",
       "1 0    43  137 137.4 25.1 25.1 31.9 32.8 0.79 0.79 124  118   70    73    \n",
       "2 0    42  150 147.0 29.3 28.7 35.5   NA 0.81 0.81 119  112   80    68    \n",
       "3 0    41  124 124.8 26.9 27.0 35.1   NA 0.84 0.84 108  107   59    65    \n",
       "4 0    40  173 171.4 32.8 32.4 41.9 42.4 1.00 1.00 116  126   71    79    \n",
       "5 0    33  163 160.2 37.9 37.2 41.7   NA 0.86 0.84 113  114   73    78    \n",
       "6 0    24   90  91.8 16.5 16.8   NA   NA 0.73 0.73  NA   NA   78    76    "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"value of mu is 26.46\"\n",
      "[1] \"value of sigma is 4.25\"\n"
     ]
    }
   ],
   "source": [
    "# BMI dataset\n",
    "dat <- read.csv(\"Practicals/Datasets/BMI/MindsetMatters.csv\")\n",
    "head(dat)\n",
    "#remove observations with no BMI data\n",
    "dat <- dat[!is.na(dat$BMI),]\n",
    "#estimate mu and sigma\n",
    "mu <- mean(dat$BMI)\n",
    "print(paste0(\"value of mu is \",round(mu,2)))\n",
    "sig <- sd(dat$BMI)\n",
    "print(paste0(\"value of sigma is \",round(sig,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what is the probability a randomly selected person in this sample has a normal BMI? \n",
    "\n",
    "#### Approach 1: Manual calculation\n",
    "One option is to make use of pre-calculated probabilities of the standard normal distribution. If we write $X$ to represent the value of a person's BMI, then we are assuming that\n",
    "\n",
    "$$\n",
    "X \\sim N(\\mu=26.5, \\sigma^2=18.1)\n",
    "$$\n",
    "\n",
    "To make use of the pre-calculated probabilities for the standard normal distribution, we must first transform our normally distributed variable to have a standard normal distribution. We know that the transformed variable $Z$ (the *Z score*) has a standard normal distribution, where\n",
    "\n",
    "$$\n",
    "Z = \\frac{X - \\mu}{\\sigma}\n",
    "$$\n",
    "\n",
    "Given values for $\\mu$ and $\\sigma$ we can go from the *X scale* to the *Z scale* and *vice versa*. The important point about describing a distribution on the Z scale is that this opens the ability to calculate specific probabilities. So back to answering the question...\n",
    "\n",
    "From the table above we can see that a normal weight is classified as a BMI between 18.5 and 24.9, and we want to know what the probability is that a randomly selected person falls between these limits. We write this as;\n",
    "\n",
    "$$\n",
    "P(18.5 < X < 24.9) \n",
    "$$\n",
    "\n",
    "On the Z-scale, this is equivalent to saying that:\n",
    "\n",
    "$$\n",
    "P(-1.87 < Z < -0.37) = P(Z < -0.37) - P(Z < -1.87)\n",
    "$$\n",
    "\n",
    "Tables exist containing a range of pre-calculated probabilities that a variable following a standard normal distribution takes a value of less than $k$, for a range of possible values of $k$. These are often called *z-tables* (found [online](http://www.z-table.com/) or at the back of most stats books). From these tables, we can look up the corresponding probability for each z-score, giving:\n",
    "\n",
    "$$\n",
    "0.3557 - 0.0307 = 0.325\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Approach 2: Using R to do the same calculation\n",
    "\n",
    "Using this approach, R is ultimately using the same pre-calculated probability tables. However, it is considerably quicker and easier to ask R to look up the values rather than finding them in tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"z_max is -0.37 and z_min is -1.87\"\n",
      "[1] \"Probability of having a healthy BMI is (z-score) 0.326\"\n"
     ]
    }
   ],
   "source": [
    "# a) if we were to use Z tables within R (to illustrate the point)\n",
    "\n",
    "z_min <- (18.5-mu)/sig\n",
    "z_max <- (24.9-mu)/sig\n",
    "\n",
    "# note when using pnorm we don't need to specify mu and sigma as the \n",
    "# function assumes mu=0 and sigma=1 unless specified.\n",
    "print(paste0(\"z_max is \",round(z_max,2),\" and z_min is \",round(z_min,2)))\n",
    "print(paste0(\"Probability of having a healthy BMI is (z-score) \",round(pnorm(z_max)-pnorm(z_min),3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Approach 3: Using R to do the  calculation on the untransformed scale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Probability of having a healthy BMI is (direct) 0.326\"\n"
     ]
    }
   ],
   "source": [
    "# b) if we were to directly estimate\n",
    " \n",
    "print(paste0(\"Probability of having a healthy BMI is (direct) \",round(pnorm(24.9,mu,sig)-pnorm(18.5,mu,sig),3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating directly gives the same result as using a z-score in R, and this returns the same information as using z-tables.\n",
    "\n",
    "In answer to our question, we estimate that the probability of having a 'healthy' weight is 32.6\\%. We can compare this to the observed proportion of our sample of data with a 'healthy' BMI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Within the data a healthy BMI is seen 35.1%\"\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAFoCAMAAAC46dgSAAAAPFBMVEUAAAAzMzNGgrRNTU1o\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fMzMzQ0NDZ2dnh4eHp6enr6+vw8PD////IVY4uAAAA\nCXBIWXMAABJ0AAASdAHeZh94AAAJkUlEQVR4nO3d6ULiShCG4daoMI6jHLn/ez2ELYvptbqw\nU3m/H2Owy0rJM4GIS9yRmI777QGIbgA2HoCNB2DjAdh4ADYegI0HYOORAP93yvmfpPyodMmV\nGU09cZJBK1Q+evcAV+gJcGQcgPUqAa7QE+DIOADrVQJcoSfAkXEA1qsEuEJPgCPjAKxXCXCF\nngBHxgFYrxLgCj0BjowDsF6lGeCPcQJ1AK8V+HkIwKMlgP0BGODsSoATB58HYM8SwP4ADHB2\nJcCJg88DsGepCrAwNXY9AVbe1zrDESzYvfUjWDgOwHqVAAcCMMDZlQAnDj4PwJ4lgP0BGODs\nSoATB58HYM8SwP4ADHB2JcCJg88DsGcJYH8ABji7EuDEwecB2LMEsD8AA5xdCXDi4PMA7FkC\n2B+AAc6uBDhx8HkA9iwB7A/AqwaO/z4iwOsGjn4EwADnDwpw2uDzAOxZAhhggKMBODgOwHqV\nAAMMcDwAB8cBWK8SYIBv6c4Z3wC4YNB2gS+ws7cA5w7aNnD3YwPgzEHXATz1BdgK8HAA35+C\nn/qk/N8I5Hf+CEvqR1hJHvDkFkewuSN4dks4DsB6lVnAne+mcByA9SoLgXmILh50PcCj41k4\nDsB6lSXAZ93xC1kAGwH2RjgOwHqVAAMMcDwAB8cBWK8SYIABjgfg4DgA61UCDDDA8QAcHAdg\nvUqAAQY4HoCD4wCsVwkwwADHA3BwHID1KgEGGOB4AA6OA7BeJcAAAxwPwMFxANarBBhggOMB\nODgOwHqVAAMMcDwAB8cBWK8SYIABjsc6sDD8EZZHhCNYsHvrR7BwHID1KgEGGOB4AA6OA7Be\nJcAAAxwPwMFxGgBOi7crwMFxGgB+9tyYLni7AhwcB2C9SoABBhhg4TgA61UCDDDAAAvHAViv\nEmCAAQZYOA7AepUAAwwwwMJxANarBBhggAEWjgOwXiXAAAMMsHAcgPUqAQYYYICF4wCsV+kF\ndtfb3eyy3wALB20CuHOjAGwP+H3k+z6wj64Kbe0C0RsDPg4P0ePjerY53BaOA7BepRd4IQB7\nu64ReN/NnoPHZ1sAFw/aCvD+x0nW5Cn4/s9Tn+jBH07hV2jT3/wbQyTtqwC4bM5GMruXu9HZ\n1fU9938aOYL9KoEP4gi+3Vw+rAAuGLSwVBf4zX0DbBn40L0eFmwBLhi0sFT7IfrHSdbIFOCc\nQQtLHw18f/WqG20DnDBoYenmv5sEcEIlwFsFXsF3kwBOqAR4q8CXHF7/JPgCvFrg47dLERaO\nA7BeZQzY95IlwBmDFpY+BPiva/dnsgBOqPQC38+x9gBbBu5SfAFeIXBWhOMArFcJ8GaBv/cv\nzr3sF78rDHDWoIWlusCH68/cdbPvCgOcP2hhqS7wzvXf8D+8uh3AJoFvL3DwQgfAAK8RmIdo\n48CcZBkH5ssk68A5EY4DsF4lwJsFfju/w73wHGwTeH/5+sit/ix6dvHBBoF9l0fUBe7cZ//m\na/VfB88WWgT27IIXOgCOV3qB39zuu/9ayb0CbBL4/kLHF8AmgW8vdKScRAO8RuCcCMcBuHVg\nYUr/CItXxVf10eQfYfGOXjccwXHg9E/Jn1UewcJxAAZ4KQAnVAIMMMAAC8cBGOClAJxQCTDA\nAAMsHAdggJcCcEIlwAADDLBwHIABXgrACZUAAwwwwMJxAAZ4KQAnVAIMMMAAC8cBGOClFAB/\nfLjlqx2mAXt+DRDgyDiPA35+dgmOSQsZg0ZKAfYE4IRKgAEGGGDhOAADvBSAEyoBBhhggIXj\nAAzwUgBOqMwBHl8UuptcIVo4DsBNAE8u6z69HppwHIABXgrACZUZwGPY2QUNheMA3B7w/Sn4\nqU/Sh/rzuL+yUxFY+Dn7JqzXdp6ke3nyCM1JVuKgkdKWjuBu+YZwHIBbAe48t4TjANwIcDfd\nAjhx0EhpM8DddHN0UzgOwE0Ad7dT5+44fVULYBvAgQjHARjgpQCcUAkwwAADLBwHYICXAnBC\nJcAAAwywcJyfwB9J8d75vqqawN5kY/jbFt+jo6VGgWV3fmChGrC3KBujuFWg57AEsHwfAEfG\nAbisVaDnsASwfB8AR8YBuKxVoOewBLB8HwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsASwfB8A\nR8YBuKxVoOewBLB8HwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsASwfB8AR8YBuKxVoOewBLB8\nHwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsFQFWJifu14zcPanX7FVKBzBxQv2j+DITmLjAFzW\nKtBzWAJYvg+AI+MAXNYq0HNYAli+D4Aj4wBc1irQc1gCWL6PLQKHfktuO8BpvyRZ0CrxntYE\n9g26LWBR20CrxHsa4Dr7ABjgsv0BnL8AMMAAAwwwwAADXN4KYO8CwAADDDDAAAMMcHkrgL0L\nAAMMMMAAAwwwwOWtAPYuAAwwwAADDDDAAJe3Ati7ALA/42sGx64fDPD6gMeXdY9e4h1ggAEu\nbwWwdwHgTOCnPj+LE3/t6hKXVS3LI/f1Cwn41T2CQ0k4gn2VGU09cZJBK1Q+evcAV+gJcGQc\ngPUqAa7QE+DIOADrVWYA31+96kbbAGdWtgzsj3AcgPUqAa7QE+DIOADrVQJcoSfAkXEA1qsE\nuEJPgCPjAKxXWQe4z8K3lMTR6LnZQQFurCfADTVdw6AAN9azNWDSeAA2HoCNB2DjAdh4ADYe\nCfDlhzu66Y95iHL/6ZF6LadNNzeoBPg6RL1P8P4zX7Of/arTdJODSoC7I8DND1rhIbrqNNeG\n9Rt3mx1UDlz1aeioeL9tc9A6R3DlR7/q91v9pheDyoNqNJWfRU+3xFEDnm7UaapyBE/eitMa\nsNqjwnyrStctAFd+5Js1brSp6ll0Yw/RVc/ru1nj2k2bthg3begkq+aLOd3tBLLmeaRa0/Hb\ndpvyWrTxAGw8ABsPwMYDsPEAbDwAGw/AxgOw8WwY2F2y+zxtd2/vh/M7D+9vXb/0u6NVjJ3P\nJDvuln/n7d35nTvX4wJsIVfFP+6l3365vrL+ArCV3BQvon9c/1D9eXoLsJHcj+B9v32iPd/4\nBNhKbk/Bu/P26cH59PbFHQG2khvw23n7dHp1OB5O2gBbyVXx7+Uh+vjPvR/f3V+AzeR+knX5\nwvfbvR5f3TfAZjI9iz72uidjgM3kovi975+E++1399afSQNsJfdXsr4uoodhE2ALueh2u6/j\nVbTrn40BJqsKwMYDsPEAbDwAGw/AxgOw8QBsPAAbD8DGA7DxAGw8/wOIgTQYMSd6RgAAAABJ\nRU5ErkJggg==",
      "text/plain": [
       "plot without title"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# c) provide a sanity check against the data\n",
    "options(repr.plot.width=4, repr.plot.height=3)\n",
    "library(ggplot2)\n",
    "\n",
    "ggplot(dat,aes(x=BMI)) + geom_histogram(bins = 30,fill=\"steelblue\",col=\"grey80\") +\n",
    "    geom_vline(xintercept = c(18.5,24.9))\n",
    "#hist(dat$BMI,col=\"steelblue\")\n",
    "#abline(v=c(18.5,24.9),lty=2)\n",
    "print(paste0(\"Within the data a healthy BMI is seen \",\n",
    "             round(100*((sum(dat$BMI<24.9)-sum(dat$BMI<18.5))/length(dat$BMI)),1),\"%\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we can see that the sample estimate (35.1%) is roughly similar to the population estimate of 32.6%. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}